We introduce Transfusion, a recipe for training a multi-modal model overdiscrete and continuous data. Transfusion combines the language modeling lossfunction (next token prediction) with diffusion to train a single transformerover mixed-modality sequences. We pretrain multiple Transfusion models up to 7Bparameters from scratch on a mixture of text and image data, establishingscaling laws with respect to a variety of uni- and cross-modal benchmarks. Ourexperiments show that Transfusion scales significantly better than quantizingimages and training a language model over discrete image tokens. By introducingmodality-specific encoding and decoding layers, we can further improve theperformance of Transfusion models, and even compress each image to just 16patches. We further demonstrate that scaling our Transfusion recipe to 7Bparameters and 2T multi-modal tokens produces a model that can generate imagesand text on a par with similar scale diffusion models and language models,reaping the benefits of both worlds.
首页 >
Transfusion Predict the Next Token and Diffuse Images > Paper page